Does topography have an effect on the likelihood of car break-ins? This regression analysis seeks to provide insights into this question using data from San Francisco, which has the ideal geography for this context.

That is one steep street!
The specific analysis plan steps to address this research question are found in Table 1 below.
data.frame("Step" = c(1:9),
"Description" = c("Identify research question",
"Select key variables (based on existing literature)",
"Collect data",
"Merge data",
"Data description",
"Visualize relationships",
"Test OLS assumptions",
"Conduct regression analysis",
"Interpret results")) %>%
kable(caption = "Table 1. Analysis Plan") %>%
kable_paper(full_width = FALSE) %>%
column_spec(1, bold = T) %>%
row_spec(0, bold = T, color = "black")
| Step | Description |
|---|---|
| 1 | Identify research question |
| 2 | Select key variables (based on existing literature) |
| 3 | Collect data |
| 4 | Merge data |
| 5 | Data description |
| 6 | Visualize relationships |
| 7 | Test OLS assumptions |
| 8 | Conduct regression analysis |
| 9 | Interpret results |
My research question is: "What is the relationship between topography and car break-ins in San Francisco?
This analysis will focus on the link between elevation and “hilliness” in determining the likelihood of car break-ins in San Francisco. Both terrain and motor vehicle crimes are ubiquitous when discussing living or visiting San Francisco. This relationship has been coined by the phrase - “Crime Doesn’t Climb”.
In April 2021, Young-An Kim & James C. Wo published Topography and crime in place: The effects of elevation, slope, and betweenness in San Francisco street segments. This study analyzes the effects of “hilliness” on crime, controlling for socio-demographic characteristics Kim and Wo (2021). However, their analysis did not separate by specific crime categories and instead included all types of crime, including violent, nonviolent, property, etc. My analysis will focus only on car break-ins rather than all crime reports, as I believe that these crimes will have a more significant relationship with topography.
Understanding the relationship between topography and car break-ins can motivate local-level policy decisions, such as implementing measures proven to reduce crime in areas identified as having characteristics more susceptible to break-ins.
Table 2 below contains the key measures of interest in our analysis:
data.frame("Dependent Variable" = c("Crime (specifically car break-ins)", ""),
"Independent Variables" = c("Elevation", "Slope"),
"Control Variable" = c("Median Income","")) %>%
kable(col.names = c("Dependent Variable", "Independent Variables", "Control Variable"), caption = "Table 2. Key variables for regression analysis") %>%
kable_paper(full_width = FALSE) %>%
column_spec(1, bold = T) %>%
row_spec(0, bold = T, color = "black")
| Dependent Variable | Independent Variables | Control Variable |
|---|---|---|
| Crime (specifically car break-ins) | Elevation | Median Income |
| Slope |
When discussing topography, both elevation and “hilliness,” or slope, are necessary for inclusion. This is because it more accurately captures the effect of local level topography, which is supported by Kim & Wo.
In any econometric analysis, it is vital to control for socio-economic variables. In this case, it could be that higher elevations in the city are more affluent areas, which may have an impact on crime. Thus, we want to include median income as a control variable.
Here is the regression equation:
\[MotorVehicleTheft_i = \beta_0 + \beta_1Elevation_i + \beta_2Slope_i + \beta_3MedianIncome_i + u_i\]
tidycensus package via the US Census Bureau.The following code chunk demonstrates how the crime, elevation, slope, and income datasets were merged.
# Find index of nearest contour to each crime
elev <- st_nearest_feature(x = crimes, y = contours)
# Add elevation and binary slope columns
crimes <- crimes %>%
st_join(y = census_geom, join = st_within, left = TRUE) %>%
mutate(elev = contours[elev,]$elevation) %>%
rename(median_income = estimate) %>%
select(date_incid, slope, median_income, elev, geometry)
The summary statistics for the data are provided in Table 3 below.
crimes %>%
st_drop_geometry() %>%
select(slope, elev, median_income) %>%
psych::describe(fast=TRUE) %>%
kable(col.names = c("", "Count", "Mean", "SD", "Min", "Max", "Range", "SE"), caption = "Table 3. Summary statistics for Slope, Elevation, and Median Income variables") %>%
kable_paper(full_width = FALSE) %>%
column_spec(1, bold = T) %>%
row_spec(0, bold = T, color = "black") %>%
row_spec(c(1,2), bold = T, background = "#E5E5E5")
| Count | Mean | SD | Min | Max | Range | SE | ||
|---|---|---|---|---|---|---|---|---|
| slope | 1 | 24454 | 4.492966 | 4.793824 | 0 | 68 | 68 | 0.0306554 |
| elev | 2 | 24454 | 129.321379 | 120.439659 | -5 | 780 | 785 | 0.7701841 |
| median_income | 3 | 23764 | 114039.269567 | 47501.416992 | 12340 | 208425 | 196085 | 308.1390882 |
slope_box <- ggplot(data = crimes, aes(x = "", y = slope)) +
geom_boxplot() +
geom_jitter(aes(color = slope),
width = 0.2,
size=0.4,
alpha=0.025,
show.legend = FALSE) +
theme_classic() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank()) +
labs(x = "Slope (percent)")
elev_box <- ggplot(data = crimes, aes(x = "", y = elev)) +
geom_boxplot() +
geom_jitter(aes(color = elev),
width = 0.2,
size=0.4,
alpha=0.025,
show.legend = FALSE) +
theme_classic() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank()) +
labs(x = "Elevation (feet)")
income_box <- ggplot(data = crimes, aes(x = "", y = median_income)) +
geom_boxplot() +
geom_jitter(aes(color = median_income),
width = 0.2,
size=0.4,
alpha=0.025,
show.legend = FALSE) +
theme_classic() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank()) +
labs(x = "Median Income (USD)")
elev_box + slope_box + income_box + plot_annotation(title = 'Boxplots of Independent Variables',
theme = theme(plot.title = element_text(hjust = 0.5)))

The following plots show the simple relationships between count of car break ins and the three independent variables (elevation, slope, and median income).
# Group by income
income_summary <- crimes %>%
st_drop_geometry() %>%
group_by(median_income) %>%
summarize(count = n())
income_plot = ggplot(data = income_summary, aes(x = median_income, y = count)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
theme_classic() +
labs(title = "Crime and Median Income",
x = "Median Income (USD)",
y = "Number of Break-Ins")
# Group by elevation
elev_summary <- crimes %>%
st_drop_geometry() %>%
group_by(elev) %>%
summarize(count = n())
elev_plot <- ggplot(data = elev_summary, aes(x = elev, y = count)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
theme_classic() +
labs(title = "Crime and Elevation",
x = "Elevation (feet)",
y = "Number of Break-Ins")
# Group by slope
slope_summary <- crimes %>%
st_drop_geometry() %>%
group_by(slope) %>%
summarize(count = n())
slope_plot <- ggplot(data = slope_summary, aes(x = slope, y = count)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
theme_classic() +
labs(title = "Crime and Slope",
x = "Slope (percent)",
y = "Number of Break-Ins")
elev_plot + (slope_plot / income_plot)

We see there is a negative correlation between elevation and crime. It is possible that this relationship is not linear, so this variable may fit better if we take the natural log. Since slope in this analysis is a binary variable, we simply see that there are more crimes in areas that are not designated high slope. Lastly, we observe a weak negative correlation between median income and car break-ins.
The following plots show crimes over the time period of our data.
# Group by all three variables
crimes_summary <- crimes %>%
st_drop_geometry() %>%
group_by(slope, median_income, elev) %>%
summarize(count = n())
# Group by date
crimes_ts <- crimes %>%
st_drop_geometry() %>%
group_by(date_incid) %>%
summarize(count = n())
ts_plot <- ggplot(data = crimes_ts, aes(x = date_incid, y = count)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_classic() +
labs(title = "Daily Crime 2018-Present",
x = "Date (daily)",
y = "Number of Break-Ins")
crimes_monthly <- crimes_ts %>%
mutate(month = lubridate::floor_date(date_incid, "month")) %>%
group_by(month) %>%
summarize(monthly_sum = sum(count)) %>%
filter(month != "2021-11-01")
monthly_plot <- ggplot(data = crimes_monthly, aes(x = month, y = monthly_sum)) +
geom_point() +
geom_line() +
geom_smooth(method = "lm", se = FALSE) +
theme_classic() +
labs(title = "Monthly Crime 2018-Present",
x = "Date (monthly)",
y = "Number of Break-Ins")
ts_plot + monthly_plot

We can see that there is an upward trend to our crime data over the time period of our data.